First of all, we will clean the workspace and load all the necessary libraries:
rm(list=ls())
library(naniar)
library(mice)
library(tidyverse)
library(factoextra)
library(GGally)
library(xfun)
library(rmarkdown)
library(dplyr)
library(ggplot2)
library(cluster)
library(dendextend)
library(psych)
library(tidyr)
library(kernlab)
library(mclust)
library(NbClust)
library(Amelia)
library(corrplot)
library(VIM)The dataset we are going to use for this project is called HR Analytics: Job Change of Data Scientists. Basically, a company wants to hire a data scientist so they store the data for the most promising profiles, since they want to reduce training and costs for the long run. This dataset is designed to understand which factors lead to a person to leave their current job, they want to hire the person that is the most likely to stay in the company based on their stored data. The link to download the data can be found in the following link: click here
For this particular project, we are going to use the train dataset, since I consider it has more value and more variables to it than the other one (for instance the target column).
setwd("~/UC3M/2nd year/statistical learning/archive")
data = read.csv("aug_train.csv", header = T, na.strings = "")
View(data)
head(data)## enrollee_id city city_development_index gender relevent_experience
## 1 8949 city_103 0.920 Male Has relevent experience
## 2 29725 city_40 0.776 Male No relevent experience
## 3 11561 city_21 0.624 <NA> No relevent experience
## 4 33241 city_115 0.789 <NA> No relevent experience
## 5 666 city_162 0.767 Male Has relevent experience
## 6 21651 city_176 0.764 <NA> Has relevent experience
## enrolled_university education_level major_discipline experience company_size
## 1 no_enrollment Graduate STEM >20 <NA>
## 2 no_enrollment Graduate STEM 15 50-99
## 3 Full time course Graduate STEM 5 <NA>
## 4 <NA> Graduate Business Degree <1 <NA>
## 5 no_enrollment Masters STEM >20 50-99
## 6 Part time course Graduate STEM 11 <NA>
## company_type last_new_job training_hours target
## 1 <NA> 1 36 1
## 2 Pvt Ltd >4 47 0
## 3 <NA> never 83 0
## 4 Pvt Ltd never 52 1
## 5 Funded Startup 4 8 0
## 6 <NA> 1 24 1
The features in the database are the following (they are mostly self-explanatory): * enrollee_id
city
city_development_index ->> the development the city where they are from has
gender
relevent_experience
enrolled_university
education_level
major_discipline
experience
company_size
company_type
last_new_job
training_hours
target->> (0) not looking for a job change (1)looking for a job change
Now, letās dive into the dataset.
str(data)## 'data.frame': 19158 obs. of 14 variables:
## $ enrollee_id : int 8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
## $ city : chr "city_103" "city_40" "city_21" "city_115" ...
## $ city_development_index: num 0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
## $ gender : chr "Male" "Male" NA NA ...
## $ relevent_experience : chr "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
## $ enrolled_university : chr "no_enrollment" "no_enrollment" "Full time course" NA ...
## $ education_level : chr "Graduate" "Graduate" "Graduate" "Graduate" ...
## $ major_discipline : chr "STEM" "STEM" "STEM" "Business Degree" ...
## $ experience : chr ">20" "15" "5" "<1" ...
## $ company_size : chr NA "50-99" NA NA ...
## $ company_type : chr NA "Pvt Ltd" NA "Pvt Ltd" ...
## $ last_new_job : chr "1" ">4" "never" "never" ...
## $ training_hours : int 36 47 83 52 8 24 24 18 46 123 ...
## $ target : num 1 0 0 1 0 1 0 1 1 0 ...
We are going to change the types to mostly factor so we can work on them better.
data$city = as.factor(data$city)
data$gender = as.factor(data$gender)
data$relevent_experience = as.factor(data$relevent_experience)
data$enrolled_university = as.factor(data$enrolled_university)
data$education_level = as.factor(data$education_level)
data$major_discipline = as.factor(data$major_discipline)
data$experience = as.factor(data$experience)
data$company_size = as.factor(data$company_size)
data$last_new_job = as.factor(data$last_new_job)
data$training_hours = as.integer(data$training_hours)
data$target = as.factor(data$target)There are some values in the company_size variable that are not in the right format, so letās change it.
data$company_size = gsub("/","-",data$company_size)Letās see our NAās and plot them so we can inspect them and see what to do to fix it.
sum(is.na(data))## [1] 20733
We have a lot of missing values! Letās see them graphically.
hist(rowMeans(is.na(data)))barplot(colMeans(is.na(data)), las=2)aggr(data, numbers = TRUE, sortVars = TRUE, labels = names(data),
cex.axis = .7, gap = 1, ylab= c('Missing data','Pattern'))##
## Variables sorted by number of missings:
## Variable Count
## company_type 0.320492745
## company_size 0.309948846
## gender 0.235306399
## major_discipline 0.146831611
## education_level 0.024010857
## last_new_job 0.022079549
## enrolled_university 0.020148241
## experience 0.003392839
## enrollee_id 0.000000000
## city 0.000000000
## city_development_index 0.000000000
## relevent_experience 0.000000000
## training_hours 0.000000000
## target 0.000000000
We checked what we already knew, that there a lot of missing values in a lot of different variables!
gg_miss_upset(data)The upset plot shows the combination of missings, by default choosing the 5 variables with the most missing values, and then orders them by the size of the missing values in that set. Basically, we can see that we have the most NAs where we have more observatoins (in company_size and company_type), whereas we get the minimal NAs in education_level where we have less observations, it makes more sense. Letās see exactly how many NAs are in each variable and plot it
gender = sum(is.na(data$gender))
en_uni = sum(is.na(data$enrolled_university))
ed_lvl = sum(is.na(data$education_level))
m_disc = sum(is.na(data$major_discipline))
exp = sum(is.na(data$experience))
com__size = sum(is.na(data$company_size))
com_type = sum(is.na(data$company_type))
job = sum(is.na(data$last_new_job))null_df = data.frame(variable=c("enrrollee_id", "city", "gender",
"city_development_index", "relevent_experience","enrolled_university","education_level",
"major_discipline", "experience", "company_size", "company_type", "last_new_job",
"training_hours", "target"),
total_null = c(0, 0, 4508, 0,0, 386, 460, 2813, 65, 5938, 6140,423,0,0))
null_df## variable total_null
## 1 enrrollee_id 0
## 2 city 0
## 3 gender 4508
## 4 city_development_index 0
## 5 relevent_experience 0
## 6 enrolled_university 386
## 7 education_level 460
## 8 major_discipline 2813
## 9 experience 65
## 10 company_size 5938
## 11 company_type 6140
## 12 last_new_job 423
## 13 training_hours 0
## 14 target 0
null_df$variable = factor(null_df$variable,
levels = null_df$variable[order(null_df$total_null,decreasing = TRUE)])
ggplot(null_df,aes(x=variable,y=total_null))+ geom_bar(stat = "identity", col = "#99FF99", fill = "#CCFF99")+ theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
geom_label(aes(label = total_null, size = NULL), nudge_y = 0.6)+
theme(plot.title = element_text(hjust = 0.6))+
ggtitle("Total NAs by col")+
xlab("Missing values")Also did the missmap and got 12% of missing observationsā¦letās fix it by replacing the NAs for the average of the variable where is missing. Otherwise, we would miss a lot of information and valuable data to make the unsupervised learning to begin with. (And it was time and power consuming).
data$city_development_index[which(is.na(data$city_development_index))] = mean(data$city_development_index, na.rm = TRUE)
data$gender[which(is.na(data$gender))] = "Male"
data$company_size[which(is.na(data$company_size))] = "50-99"
data$relevent_experience[which(is.na(data$relevent_experience))] = "No relevent experience"
data$enrolled_university[which(is.na(data$enrolled_university))] = "no_enrollment"
data$education_level[which(is.na(data$education_level))] = "Graduate"
data$major_discipline[which(is.na(data$major_discipline))] = "STEM"
data$experience[which(is.na(data$experience))] = ">20"
data$last_new_job[which(is.na(data$last_new_job))] = 1
data$training_hours[which(is.na(data$training_hours))] = mean(data$training_hours, na.rm = TRUE)
sum(data$target == 0)## [1] 14381
sum(data$target == 1)## [1] 4777
data$target[which(is.na(data$target))] = 0Now, we are going to remove those variables which we consider not relevant, such as city, enrolle_id and the company_type. We see that there are no NAs left.
data = select(data,-enrollee_id, -city,-company_type)sum(is.na(data))## [1] 0
Letās get the insight of what is going on in our data. First of all, we can see that there are way more men than any other gender (about 90%).
sum(data$gender == "Female")## [1] 1238
sum(data$gender == "Male")## [1] 17729
sum(data$gender == "Other")## [1] 191
ggplot(data, aes(x=major_discipline))+
geom_bar(fill = "#58A4B0")+
geom_text(stat='count', aes(label=..count..), vjust=-1)+
ggtitle("Major Distribution")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Major")We have way more people in STEM than any other Major, by far as seen above (90.32%).
I also made a plot with all the people that have a major divided by gender but it was not significative since in our data there are mostly men, therefore, anything we divide by gender is going to tell us that there are more men.
ggplot(data,aes(x=relevent_experience))+
geom_bar(fill = "#373F51")+
facet_wrap(~education_level)+
ggtitle("Relevent experience by education level")+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Experience")+
ylab("Count")We appreciate that those with the most relevant experience are graduates from university, followed by those with a masterās degree.
ggplot(data, aes(x=major_discipline == "STEM"))+
geom_bar(fill = "#58A4B0")+
geom_text(stat='count', aes(label=..count..), vjust=-1)+
ggtitle("Major Distribution")+
facet_wrap(~enrolled_university)+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Enrollement in STEM")We know that out of our candidates, 73% were not enrolled in university, so it adds up that most of them where not enrolled in STEM. But, out of those, it is more usual for them to be enrolled in the full time course rather than the part time one.
ggplot(data,aes(x= training_hours,fill = relevent_experience))+
geom_density(alpha = 0.5)ggplot(data,aes(x= training_hours,fill = relevent_experience))+
geom_density(alpha = 0.5) + facet_wrap(~relevent_experience)ggplot(data,aes(x= training_hours,fill = relevent_experience))+
geom_density(alpha = 0.5)+ xlim(10,70)It is basically the same training hours completed whether or not the candidate had any relevant experience. And we can see that the average of those training hours is around 20 to 30 hours.
ggplot(data, aes(x=city_development_index, fill=relevent_experience)) +
geom_density(adjust=1.5, alpha=.4) +
theme_light()The ones with most relevant experience come from a city with a bigger development index, but there are also a lot of them without it. We also see that around 0.6 development index, there is a little peak with people with relevant experience. But most of the candidates are from the 0.9 city development index, too.
ggplot(data, aes(x=training_hours, group=education_level, fill=education_level)) +
geom_density(adjust=1.5, alpha = 0.5) +
theme_bw() + facet_wrap(~education_level)This time, just like the relevant experience, we see that the training hours is independent of the candidateās level of education, they are mostly equal.
ggplot(data,aes(x=education_level,fill = education_level))+
geom_bar()+
facet_wrap(~target)+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
ggtitle("Target by education level")+
xlab("Target")+
ylab("Count")Now, the people that are most likely to want to leave their current job are Graduates, but also they are the most likely to not want to leave it so letās check it by years of experience.
ggplot(data,aes(x=experience,fill = experience))+
geom_bar()+
facet_wrap(~target)+
geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
ggtitle("Target by years of experience")+
xlab("Target")+
ylab("Count")What is interesting to see is that the ones most likely to want to change their current job are those with more than 20 years of experience and those with 4 to 6 years of experience.
Letās change most of our variables to numeric first:
data$city = as.numeric(data$city)
data$relevent_experience = as.numeric(data$relevent_experience)
data$gender = as.numeric(data$gender)
data$experience = as.numeric(data$experience)
data$last_new_job = as.numeric(data$last_new_job)
data$training_hours = as.numeric(data$training_hours)
data$target = as.numeric(data$target)
data$education_level = as.numeric(data$education_level)
data$major_discipline = as.numeric(data$major_discipline)
data$enrolled_university = as.numeric(data$enrolled_university)
data$company_size = as.factor(data$company_size)
data$company_size = as.numeric(data$company_size)
data = select(data, -city)corrplot(cor(data), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)ggcorr(data, label = T)We do not have a really strong correlation between variables but letās keep going either way.
pca1 = prcomp(data, scale=T)
summary(pca1)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion 0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
## PC8 PC9 PC10 PC11
## Standard deviation 0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion 0.81371 0.88656 0.94642 1.00000
We achieve a table with all the components but letās plot it so we can visually have a better insight.
screeplot(pca1,main="Screeplot",col="#aC5E99",type="barplot",pch=19)But is nicer and much more informative if we use the factoextra:
fviz_screeplot(pca1, addlabels = TRUE)We see that we need most of the components to explain the variance, because just 1 explains barely the 17%.
fviz_pca_ind(pca1, geom.ind = "point", col.ind = "#aC5E99",
axes = c(1, 2), pointsize = 1.5, repel = T) What would have happened if we scaled the data?
pca2 = prcomp(data, center = T, scale. = T); summary(pca2)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion 0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
## PC8 PC9 PC10 PC11
## Standard deviation 0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion 0.81371 0.88656 0.94642 1.00000
fviz_screeplot(pca2, addlabels = TRUE)fviz_pca_var(pca2, col.var = "cos2", geom.var = "arrow",
labelsize = 2, repel = T)barplot(pca1$rotation[,1], las=2, col="#aC5E99")barplot(pca2$rotation[,1], las=2, col="#58A4B0")Both of them are the same and the same variance is explained either way.
fviz_contrib(pca1, choice = "var", axes = 2)Here we can see the contribution depending on the variable, being the city_development_index the one that contributes the most, followed by education_level and relevent_experience.
head(get_pca_ind(pca1)$contrib[,1])## 1 2 3 4 5 6
## 3.777108e-04 2.202977e-04 2.923932e-02 5.666895e-03 3.247206e-05 9.915507e-04
The contribution indices are really low.
Factor Analysis and PCA are pretty similar. Both of them describe the variablity in our data. In this case, FA is a technique used to reduce a large number of variables into a fewer number of factors. The biggest difference is that FA is restricted by normality and linearity. Letās use 3 factors as a default.
factosys = factanal(data,factors = 3, rotation = "none", scores = "regression")
factosys##
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "none")
##
## Uniquenesses:
## city_development_index gender relevent_experience
## 0.685 0.998 0.390
## enrolled_university education_level major_discipline
## 0.834 0.973 0.948
## experience company_size last_new_job
## 0.860 0.931 0.832
## training_hours target
## 0.999 0.005
##
## Loadings:
## Factor1 Factor2 Factor3
## city_development_index -0.343 0.437
## gender
## relevent_experience 0.130 0.762 0.111
## enrolled_university -0.122 -0.368 0.124
## education_level 0.118
## major_discipline -0.226
## experience 0.168 -0.327
## company_size 0.102 0.229
## last_new_job 0.379 -0.151
## training_hours
## target 0.997
##
## Factor1 Factor2 Factor3
## SS loadings 1.170 0.954 0.422
## Proportion Var 0.106 0.087 0.038
## Cumulative Var 0.106 0.193 0.231
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
cbind(factosys$loadings, factosys$uniquenesses)## Factor1 Factor2 Factor3
## city_development_index -0.343198908 -0.080846443 0.436506244 0.6851434
## gender -0.006668926 0.012987558 -0.046550120 0.9976168
## relevent_experience 0.130161918 0.762190626 0.111095547 0.3897815
## enrolled_university -0.121909998 -0.368158317 0.124301108 0.8341446
## education_level -0.084114978 0.077107642 0.118282651 0.9729891
## major_discipline 0.013587355 0.030616715 -0.225659161 0.9479514
## experience 0.070105669 0.167865929 -0.326775589 0.8601227
## company_size 0.101791249 0.228602392 0.082825361 0.9305236
## last_new_job 0.044875196 0.378766910 -0.150702098 0.8318011
## training_hours -0.021626402 -0.009556137 -0.018994553 0.9990832
## target 0.997494132 -0.002014260 0.001193804 0.0050000
The first factor explains a 99% of the target, the second factor explains mostly the relevent_experience and the third one is mostly related to the city_development_index.
factosys2 = factanal(data,factors = 3, rotation = "varimax", scores = "regression")
factosys2##
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "varimax")
##
## Uniquenesses:
## city_development_index gender relevent_experience
## 0.685 0.998 0.390
## enrolled_university education_level major_discipline
## 0.834 0.973 0.948
## experience company_size last_new_job
## 0.860 0.931 0.832
## training_hours target
## 0.999 0.005
##
## Loadings:
## Factor1 Factor2 Factor3
## city_development_index -0.283 -0.480
## gender
## relevent_experience 0.780
## enrolled_university -0.362 -0.177
## education_level -0.118
## major_discipline 0.228
## experience 0.134 0.349
## company_size 0.247
## last_new_job 0.360 0.195
## training_hours
## target 0.984 0.116 0.115
##
## Factor1 Factor2 Factor3
## SS loadings 1.068 0.973 0.505
## Proportion Var 0.097 0.088 0.046
## Cumulative Var 0.097 0.186 0.231
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
cbind(factosys2$loadings, factosys2$uniquenesses)## Factor1 Factor2 Factor3
## city_development_index -0.28326449 -0.066857440 -0.47974019 0.6851434
## gender -0.01303137 0.006317856 0.04658683 0.9976168
## relevent_experience 0.03988904 0.780056966 -0.01178520 0.3897815
## enrolled_university -0.05931015 -0.362004784 -0.17688471 0.8341446
## education_level -0.08096397 0.080505905 -0.11821733 0.9729891
## major_discipline -0.01365783 0.004182644 0.22768405 0.9479514
## experience 0.01377505 0.133748076 0.34899536 0.8601227
## company_size 0.07890051 0.247431743 -0.04508652 0.9305236
## last_new_job -0.02074055 0.360125061 0.19511291 0.8318011
## training_hours -0.02201464 -0.014295825 0.01519208 0.9990832
## target 0.98402823 0.116063458 0.11496846 0.0050000
barplot(factosys$loadings[,1], names=F, las=2, col="#6633CC", ylim = c(-1, 1))barplot(factosys$loadings[,2], names=F, las=2, col="#6633CC", ylim = c(-1, 1))barplot(factosys$loadings[,3], las=2, col="#6633CC", ylim = c(-1, 1))barplot(factosys2$loadings[,1], names=F, las=2, col="#FF9966", ylim = c(-1, 1))barplot(factosys2$loadings[,2], names=F, las=2, col="#FF9966", ylim = c(-1, 1))barplot(factosys2$loadings[,3], las=2, col="#FF9966", ylim = c(-1, 1))Clustering consists of grouping abstract objects into similar groups of our dataset. Firstly, we will start with kmeans with 5 centers by default.
data_scaled= scale(data)
k1 = kmeans(data_scaled, centers = 5, nstart = 100)
k1_cluster = k1$cluster
k1_centers = k1$centers
#representation
barplot(table(k1_cluster), col="#9999CC", xlab = "clusters")Now, letās understand the centers:
bar1 = barplot(k1_centers[1,], las=2, col="#9999CC",ylim = c(-2,2),
main= paste("Cluster 1: center (blue) and global center (pink)"))
points(bar1, y = apply(data_scaled, 2, quantile, 0.50),col="#E85D75", pch=19)Now, we plot it in the second center.
bar2 = barplot(k1_centers[2,], las=2, col="#9999CC",ylim = c(-2,2),
main= paste("Cluster 1: center (blue) and global center (pink)"))
points(bar1, y = apply(data_scaled, 2, quantile, 0.50),col="#E85D75", pch=19)fviz_cluster(k1, data = data_scaled, geom = c("point"), ellipse.type = "norm")+
theme_minimal()With the silhouette method we can check how many groups is the optimal number of centers.
#fviz_nbclust(data_scaled, kmeans, method = 'silhouette')
#I have also tried to plot it using the silhouette function but it seems it takes a lot of capacity so I was not able to.#fviz_nbclust(data_scaled, kmeans, method = 'wss')My laptop is not strong enough to be able to run this since is very power consuming but let us assume that the ideal number is 3.
k2 = kmeans(data_scaled, centers = 3, nstart = 100)
k2_clusters = k2$cluster
k2_centers = k2$centers
barplot(table(k2_clusters), col="#9999CC", xlab = "clusters")b2 = barplot(k2_centers[3,], las = 2, col="#FFC15E",ylim = c(-2,2), xlab = "appropriate cluster", ylab = "inappropriate observations")
points(bar2 ,y = apply(data_scaled, 2, quantile, 0.5), col ="#BF4E30", pch = 20)fviz_cluster(k2, data = data_scaled, geom = c("point"), ellipse.type = "norm")+
theme_minimal()If we use this other plot, we have less overlapping:
#k2_min2 = eclust(data_scaled, "kmeans", stand=T, k=3, graph = T, hc_metric = "minkowski")Either way, letās try with 2 clusters:
#k2_min2 = eclust(data_scaled, "kmeans", stand=T, k=2, graph = T, hc_metric = "minkowski")It does not run when knitted since it produces an error due to the amount of GBs required. But it did run in my computer and the one with 2 clusters the overlapping was basically non-existing.
The profiles are those variables not included in the clustering. So we are going to use those that we left outside since we considered that were non-relevant to better understand our clusters. Let us consider 3 clusters of those 3 variables (company_size, training_hours and education_level):
fit = kmeans(data_scaled, centers=3, nstart=100)
groups = fit$cluster
# We consider first the training_hours
as.data.frame(data_scaled) %>% mutate(cluster=factor(groups), company_size=company_size, min=training_hours, education_level = education_level) %>%
ggplot(aes(x = cluster, y = min)) +
geom_boxplot(fill="lightblue") +
labs(title = "training hours by cluster", x = "", y = "", col = "")We can see in the 3 clusters, there a lot of outliers. Letās see now the company size:
as.data.frame(data_scaled) %>% mutate(cluster=factor(groups), company_size=company_size, min=training_hours, education_level = education_level) %>%
ggplot(aes(x = cluster, y = company_size)) +
geom_boxplot(fill="lightblue") +
labs(title = " company_size by cluster", x = "", y = "", col = "")We see that the first cluster is actually pretty good, the other two have the median quite tilted so is not that good.
fviz_cluster(fit, data = data_scaled, geom = c("point"), ellipse.type = "norm")+
theme_dark()The last way we are going to try is the Mahalanobis distance.This method measures the distance between a point and a distribution. Itās a multidimensional method. Let us try it with 3 centers:
fit.mahalanobis = kmeans(data_scaled, centers=3, nstart=100)
groups_mah = fit.mahalanobis$cluster
centers_mah=fit.mahalanobis$centers
colnames(centers_mah)=colnames(data_scaled)
fviz_cluster(fit.mahalanobis, data = data_scaled, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal() This method is less power consuming and has a neat graphical interpretation (basically the same as the first one), both methods are valid. The only one I would left behind is the Minkowski one.
Basically, studying the behavior of humans is always tricky since sometimes we are almost unpredictable about our desires and our future is always uncertain. Thus, is hard to know why people stay or leave a commpany depending on their background. But it was interesting to see the relations that it may cause (whether they are coincidences or not).
We have also learned the different methods to use in a large dataset (as in this case), and that some of them are very time and power consuming.